Selective noise/channel/coding models and recognizers for automatic speech recognition

ABSTRACT

An apparatus and method for the robust recognition of speech during a call in a noisy environment is presented. Specific background noise models are created to model various background noises which may interfere in the error free recognition of speech. These background noise models are then used to determine which noise characteristics a particular call has. Once a determination has been made of the background noise in any given call, speech recognition is carried out using the appropriate background noise model.

FIELD OF THE INVENTION

The present invention relates to the robust recognition of speech innoisy environments using specific noise environment models andrecognizers, and more particularly, to selective noise/channel/codingmodels and recognizers for automatic speech recognition.

BACKGROUND INFORMATION

Many of the speech recognition applications in current use today oftenhave difficulty properly recognizing speech in a noisy backgroundenvironment. Or, if speech recognition applications work well in onenoisy background environment, they may not work well in another. Thatis, when a speaker is speaking into a pick-up microphone/telephone witha background that is filled with extraneous noise, the speechrecognition application may incorrectly recognize the speech and is thusprone to error. Thus time and effort is wasted by the speaker and thegoals of the speech recognition applications are often not achieved. Intelephone applications it is often necessary for a human operator tothen again have the speaker repeat what has been previously spoken orattempt to decipher what has been recorded.

Thus, there has been a need for speech recognition applications to beable to correctly assess what has been spoken in a noisy backgroundenvironment. U.S. Pat. No. 5,148,489, issued Sep. 15, 1992 to Erell etal., relates to the preprocessing of noisy speech to minimize thelikelihood of errors. The speech is preprocessed by calculating for eachvector of speech in the presence of noise an estimate of clean speech.Calculations are accomplished by what is calledminimum-mean-log-spectral distance estimations using mixture models andMarkov models. However, the preprocessing calculations rely on the basicassumptions that the clean speech can be modeled because the speech andnoise are uncorrelated. As this basic assumption may not be true in allcases, errors may still occur.

U.S. Pat. No. 4,933,973, issued Jun. 12, 1990 to Porter, relates to therecognition of incoming speech signals in noise. Pre-stored templates ofnoise-free speech are modified to have the estimated spectral values ofnoise and the same signal-to-noise ratio as the incoming signal. Oncemodified, the templates are compared within a processor by a recognitionalgorithm. Thus recognition is dependent upon proper modification of thenoise-free templates. If modification is incorrectly carried out, errorsmay still be present in the speech recognition.

U.S. Pat. No. 4,720,802, issued Jan. 19, 1988 to Damoulakis et al.,relates to a noise compensation arrangement. Speech recognition iscarried out by extracting an estimate of the background noise duringunknown speech input. The noise estimate is then used to modifypre-stored noiseless speech reference signals for comparison with theunknown speech input. The comparison is accomplished by averaging valuesand generating sets of probability density signals. Correct recognitionof the unknown speech thus relies upon the proper estimation of thebackground noise and proper selection of the speech reference signals.Improper estimation and selection may cause errors to occur in thespeech recognition.

Thus, as can be seen, the industry has not yet provided a system ofrobust speech recognition which can function effectively in variousnoisy backgrounds.

SUMMARY OF THE INVENTION

In response to the above noted and other deficiencies, the presentinvention provides a method and an apparatus for robust speechrecognition in various noisy environments. Thus the speech recognitionsystem of the present invention is capable of higher performance thancurrently known methods in both noisy and other environments.Additionally, the present invention provides noise models, created tohandle specific background noises, which can quickly be determined torelate to the background noise of a specific call.

To achieve the foregoing, and in accordance with the purposes of thepresent invention, as embodied and broadly described herein, the presentinvention is directed to the robust recognition of speech in noisyenvironments using specific noise environment models and recognizers.Thus models of various noise environments are created to handle specificbackground noises. A real-time system then analyzes the background noiseof an incoming call, loads the appropriate noise model and performs thespeech recognition task with the model.

The background noise models, themselves, are created for each set ofbackground noise which may be used. Examples of the background noises tobe sampled as models would be: city noise, motor vehicle noise, trucknoise, airport noise, subway train noise, cellular interference noise,etc. Obviously, the models need not only be limited to simple backgroundnoise. For instance, various models may model different channelconditions, different telephone microphone characteristics, variousdifferent cellular coding techniques, Internet connections, and othernoises associated with the placement of a call wherein speechrecognition is to be used. Further, a complete set of sub-word modelscan be created for each characteristic by mixing different backgroundnoise characteristics.

Actual creation and collection of the models can be accomplished in anyknown manner, or any manner heretofore to be known, as long as the noisesampled can be loaded into a speech recognizer. For instance, models canbe created by recording background noise and clean speech separately andlater combining the two. Or, models can be created by recording speechwith the various background noise environments present. Or even further,for example, the models can be created using signal processing ofrecorded speech to alter it as if it had been recorded in the noisybackground.

Determination of which model to use is determined by the speechrecognition apparatus. At the beginning of a call, a sample of thesurrounding background environment from where the call is being placedis recorded. As introductory prompts, or other such messages are beingplayed to the caller, the system analyzes the recorded background noise.Different methods of analysis may be used. Once the appropriate noisemodel has been chosen on the basis of the analysis, speech recognitionis performed with the model. The system can also constantly monitor thespeech recognition function, and if it is determined that speechrecognition is not at an acceptable level, the system can replace thechosen model with another.

The present invention and its features and advantages will become moreapparent from the following detailed description with reference to theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a speech recognition apparatus for the creation,storage and use of various background noise models, according to anembodiment of the present invention.

FIG. 2 illustrates a flow chart for determination of the proper noisemodel to use, according to an embodiment of the present invention.

FIG. 3 illustrates a flow chart for robust speech recognition and, ifnecessary, model replacement, according to an embodiment of the presentinvention.

FIG. 4 illustrates a chart of an example of the selection of anappropriate background noise model to be used in the speech recognitionapplication, according to an embodiment of the present invention.

DETAILED DESCRIPTION

FIGS. 1 to 4 show a speech recognition apparatus and method for robustspeech recognition in noisy environments according to an embodiment ofthe present invention. A hidden Markov model is created to model aspecific background noise. When a call is placed, background noise isrecorded and analyzed to determine which Markov model is mostappropriate to use. Speech recognition is then carried out using theappropriately determined model. If speech recognition is not beingperformed at an acceptable level, the model may be replaced by another.

Referring to FIG. 1, various background noises 1, . . . , n, n+1 arerecorded using known sound collection devices, such as pick-upmicrophones 1, . . . , n, n+1. It is to be understood, of course, thatany collection technique, whether known or heretofore to be known, maybe used. The various background noises which can be recorded are soundssuch as: city noise, traffic noise, airport noise, subway train noise,cellular interference noise, different channel characteristics noise,various different cellular coding techniques noise, Internet connectionnoise, etc. Of course, the various individual background characteristicsmay also be mixed in infinite variations. For example, cellular channelcharacteristics noise may be mixed with background traffic noise. It isto be understood, of course, that other more various background noisemay also be recorded, what is to be recorded is not to be limited andthat any means sufficient for the recordation and/or storage of soundmay be used.

The recorded background noise is then modeled to create hidden Markovmodels for use in speech recognizers. Modeling is performed in themodeling device 10 using known modeling techniques. In this embodiment,the recorded background noise and pre-labeled speech data are putthrough algorithms which pick out phonemes creating, in essence,statistical background noise models. As described in this embodimentthen, the models are thus created by recording background noise andclean speech separately and later combining the two.

Of course, it is to be recognized that any method capable of creatingnoises models which can be uploaded into a speech recognizer can be usedin the present invention. For instance, models can be created byrecording speech with the various background noise environments present.Or, for example, the models can be created using signal processing ofthe recorded speech to alter it as if it had been recorded in the noisybackground.

The modeled background noise is then stored in an appropriate storagedevice 20. The storage device 20 itself may be located at a centralnetwork hub, or it may be reproduced and distributed locally. Thevarious stored background noise models 1, . . . , n, n+1 are thenappropriately accessed from the storage device 20 by a speechrecognition unit 30 when a call is placed by the telephone user 40.There may, of course, be more than one speech recognition unit 30 usedfor any given call. Further, the present invention will work equallywell with any technique of speech recognition using the background noisemodels.

Referring to FIG. 2, a call is placed by a user and received by thetelephone company in steps 100 and 110, respectively. It is to berecognized, of course, that although the preferred embodiment describedherein is in the context of the receipt a simple telephone call, thepresent invention will work equally well with any speech transmissiontechnique used and thus is not to be limited to the one embodiment. Oncethe connection has been made, in step 120, approximately 2 seconds worthof background noise at the caller's location is recorded and/ormonitored. Of course, various lengths of time may be used based uponadequate reception and other factors. Introductory messages,instructions or the like are then played in step 125. While thesemessages are being played, the background noise recorded in step 120 isanalyzed by the system in step 130. Even while the messages are beingplayed to the caller, the known technique of echoing cancellation may beused to record and/or monitor further background noise. In explanation,the system will effectively cancel out the messages being played in therecording and/or monitoring of the background noise.

Analysis of the background noise may be accomplished by one or moreways. Signal information, such as the type of signals (ANI, DNIS, SS7signals, etc.), channel port number, or trunk line number may be used tohelp restrict what the background noise is, and thus what backgroundnoise model would be most suitable. For example, the system maydetermine that a call received over a particular trunk line number maymore likely than not be from India, as that trunk line number is thedesignated trunk for receiving calls from India. Further, the locationof the call may be recognized by the caller's account number, time thecall is placed or other known information about the caller and/or thecall. Such information could be used as a preliminary indicator of theexistence and type of background noise.

Alternatively, or in conjunction with the preceding method, a series ofquestions or instructions to be posed to the caller with correspondinganswers to be made by the caller may be used. These answers may then beanalyzed using each model (or a pre-determined maximum number of models)to determine which models have a higher correct match percentage. Forexample, the system may carry on a dialog with the caller and instructthe caller to say “NS437W”, “Boston”, and “July 1st”. The system willthen analyze each response using the various background noise models.The model(s) with the correct match for each response by the caller canthen be used in the speech recognition application. An illustration ofthe above analysis method is found in FIG. 4. As can be seen, theanalysis of the first response “NS437W” is correctly matched by models2, 4 and n. However, only models 2 and n correctly matched the secondresponse, and only model n matched all three responses correctly. Thusmodel n would be chosen for the following speech recognitionapplication.

Also, if the system is unable to definitively decide which model and/ormodels yield the best performance in the speech recognition application,the system may either guess, use more than one model by using more thanone speech recognizer, or compare parameters of the call's recordedbackground noise to parameters contained in each background noise model.

Once a call from a particular location has been matched to a backgroundnoise model, the system can store that information in a database. Thusin step 135, a database of which background noise models are mostsuccessful in the proper analysis of the call's background noise can becreated and stored. This database can later be accessed when anotherincoming call is received from the same location. For example, it haspreviously been determined, and stored in the database, that a call froma particular location should use the city noise background noise modelin the speech recognition application, because that model results in thehighest percentage of correct speech recognitions. Thus the mostappropriate model is used. Of course, the system can dynamically updateitself by constantly re-analyzing the call's recorded background noiseto detect potential changes in the background noise environment.

Once the call's recorded background noise has been analyzed, or thedatabase has been accessed to determine where the call is coming fromand which model is most appropriate, in step 140 the most appropriatebackground noise model is selected and recalled from the storage means20. Further, alternative background noise models may be ordered on astandby basis in case speech recognition fails with the selected model.With the most appropriate background noise model having been selected,and other models ordered on standby, the system proceeds in step 150 tothe speech recognition application using the selected model.

Referring to FIG. 3, in step 160 the selected background noise model isloaded into the speech recognition unit 30. Here speech recognition isperformed using the chosen model. There is more than one method by whichthe speech recognition can be performed using the background noisemodel. The speech utterance by the caller can be routed to a presetrecognizer with the specific model(s) needed, or the necessary model(s)may be loaded into the speech recognition means 30. In step 180 thecorrectness of the speech recognition is determined. In this mannerthen, constant monitoring and adjustment can take place while the callis in progress if necessary.

Correctness of the speech recognition in step 180 may be accomplished inseveral ways. If more than one speech recognizer means 30 is being used,the correct recognition of the speech utterance may be determined byusing a voter scheme. That is, each speech recognizer unit 30, using aset of models with different background noise characteristics, willanalyze the speech utterance. A vote determines what analysis iscorrect. For example, if fifty recognizers determine that “Boston” hasbeen said by the caller, and twenty recognizers determine that“Baltimore” has been said, than the system determines in step 180 that“Boston” must be the correct speech utterance. Alternatively, or inconjunction with the above method, the system can ask the caller tovalidate the determined speech utterance. For example, the system canprompt the caller by asking “Is this correct?”. A determination ofcorrectness in step 180 can thus be made on a basis of most correctvalidations by the user and/or lowest rejections (rejections could beset high).

If the minimal criteria of correctness is not met, and thus the mostappropriate background noise model loaded in step 160 is determined tobe an unsuitable choice, a new model can be loaded. Thus in step 185,the system returns to step 160 to load a new model, perhaps the modelwhich was previously determined in step 140 to be the next in order. Theminimal criteria of correctness may be set at any level deemedappropriate and most often will be experimentally determined on thebasis of each individual system and its own separate characteristics.

If the determination in step 180 is that speech recognition isproceeding at an acceptable level, then the system can proceed to carryout the caller's desired functions, as shown in step 190.

As such, the present invention has many advantageous uses. For instance,the system is able to provide robust speech recognition in a variety ofnoisy environments. In other words, the present invention works wellover a gamut of different noisy environments and is thus easy toimplement. Not only that, but the speech recognition system is capableof a higher performance and a lower error rate than current systems.Even when the error rate begins to approach an unacceptable level, thepresent system automatically corrects itself by switching to a differentmodel(s).

It is to be understood and expected that variations in the principles ofconstruction and methodology herein disclosed in an embodiment may bemade by one skilled in the art and it is intended that suchmodifications, changes, and substitutions are to be included within thescope of the present invention.

What is claimed is:
 1. A method for the robust recognition of speech ina noisy environment, comprising the steps of: receiving the speech;recording an amount of data related to the a noisy environment, to yieldrecorded data; analyzing the recorded data; selecting at least oneappropriate a background noise model on the basis of based on therecorded data, to yield a selected background noise model; andperforming speech recognition with the at least one selected backgroundnoise model.
 2. The method according to of claim 1, further comprisingthe step of: modeling at least one a background noise in a the noisyenvironment to create at least one the background noise model.
 3. Themethod according to of claim 1, further comprising the step of:determining the a correctness of the at least one selected backgroundnoise model, wherein if when the at least one selected background noisemodel is determined to be incorrect, the method comprises loading atleast one other another background noise model for use in the step ofperforming speech recognition.
 4. The method according to of claim 1,further comprising the step of: constructing a background noise databasefor use in analyzing the recorded data on the noisy environment.
 5. Themethod according to of claim 4, wherein the background noise database isdynamically updated for each location from which data is recorded. 6.The method according to of claim 1, wherein the step of analyzing therecorded data is accomplished by using at least one of a plurality ofsignal information.
 7. The method according to of claim 1, wherein thestep of analyzing the recorded data is accomplished by using a correctmatch percentage for a plurality of background noise models determinedby an input response.
 8. The method according to of claim 1, wherein thestep of performing speech recognition is accomplished by at least one arecognizer.
 9. A method for improving recognition of speech subjected tonoise, the method comprising the steps of: sampling a connection noiseto yield sampled connection noise; searching a database for a noisemodel most closely matching that matches the sampled connection noise toyield a matching noise model; and applying the most closely matchingnoise model to a speech recognition process.
 10. The method according toof claim 9, wherein the connection noise includes at least comprises oneof city noise, motor vehicle noise, truck noise, traffic noise, airportnoise, subway train noise, cellular interference noise, channelcondition noise, telephone microphone characteristics noise, cellularcoding noise, and Internet network connection noise.
 11. The methodaccording to of claim 9, wherein the noise model is constructed bymodeling at least one the connection noise.
 12. The method according toof claim 9, wherein when a speech recognition error rate is determinedto be above a predetermined level, the system substitutes the appliedmethod further comprises substituting the matching noise model byapplying at least one other a second noise model.
 13. The methodaccording to of claim 9, wherein at least one a speech recognition unitis used when applying the matching noise model.
 14. A speech recognitionapparatus comprising: a speech recognizer; a database having storedthereon templates of a plurality of background noises; and an identifierthat identifies, via a processor, a background noise template from theplurality of background noise templates, the background noise templatematching a background noise from an input signal, to yield a matchingbackground noise template, wherein the speech recognizer recognizesspeech from the input signal with reference to the matching backgroundnoise template.
 15. The speech recognition apparatus of claim 14,wherein the identifier compares hidden Markov models of the plurality ofbackground noise templates to a hidden Markov model of the backgroundnoise from the input signal.
 16. The speech recognition apparatus ofclaim 14, wherein the identifier identifies a portion of the inputsignal that is unlikely to contain speech, to yield an identifiedportion, wherein the identified portion is used as the background noise.17. The speech recognition apparatus of claim 14, wherein theidentifier, when a plurality of background noise templates match thebackground noise, selects a template selected in a prior iteration asthe matching background noise template.
 18. The speech recognitionapparatus of claim 14, further comprising: a restrictor that restricts anumber of candidate templates based on geographic information associatedwith the input signal; a comparer that compares the background noise tothe restricted candidate templates to yield a comparison; and a selectorthat selects the matching background noise template based on thecomparison.
 19. The speech recognition apparatus of claim 14, furthercomprising: a restrictor that restricts a number of candidate templatesbased on time of day information associated with the input signal toyield restricted candidate templates; a comparer that compares thebackground noise to the restricted candidate templates to yield acomparison; and a selector that selects the matching background noisetemplate based on the comparison.
 20. The speech recognition apparatusof claim 14, further comprising: a restrictor that restricts a number ofcandidate templates based on an identifier of a user at a location fromwhich the input signal is captured to yield restricted candidatetemplates; a comparer that compares the background noise to therestricted candidate templates to yield a comparison; and a selectorthat selects the matching background noise template based on thecomparison.
 21. The speech recognition apparatus of claim 14, furthercomprising a microphone to capture the input signal.
 22. The speechrecognition apparatus of claim 14, further comprising a telephone tocapture the input signal.
 23. A speech recognition apparatus comprising:a database having stored thereon templates of a plurality of backgroundnoises; and a controller that identifies a background noise template,from the templates of the plurality of background noise templates, thatmatches background noise from a received input signal, to yield amatching background noise template, and supplies the matching backgroundnoise template to a speech recognizer.
 24. The speech recognitionapparatus of claim 23, further comprising the speech recognizer.
 25. Thespeech recognition apparatus of claim 23, further comprising amicrophone to capture the input signal.
 26. The speech recognitionapparatus of claim 23, further comprising a telephone to capture theinput signal.
 27. A method comprising: sampling a noise signal to yielda sampled noise signal; searching a database for a noise model matchingthe sampled noise signal to yield a matching noise model; and applyingthe matching noise model to a speech recognition process.
 28. The methodof claim 27, wherein the searching comprises comparing hidden Markovmodels in the database to a hidden Markov model of the sampled noisesignal.
 29. The method of claim 27, further comprising, prior to thesampling, isolating the noise signal from an input signal.
 30. Themethod of claim 27, further comprising, when a plurality of stored noisemodels match the sampled noise signal, selecting one of the plurality ofstored noise models as the matching noise model according to a selectionmade in a prior iteration.
 31. The method of claim 27, wherein thesearching comprises: restricting a set of candidate noise models basedon geographic information associated with the sampled noise signal, toyield a restricted set of candidate noise models; comparing the samplednoise signal to the restricted set of candidate noise models, to yield acomparison; and selecting the matching noise model based on thecomparison.
 32. The method of claim 27, wherein the searching comprises:restricting a set of candidate noise models based on time of dayinformation associated with the sampled noise signal, to yield arestricted set of candidate noise models; comparing the sampled noisesignal to the restricted set of candidate noise models, to yield acomparison; and selecting the matching noise model based on thecomparison.
 33. The method of claim 27, wherein the searching comprises:restricting a set of candidate noise models based on an identifier of auser at a location from which the sampled noise signal is captured, toyield a restricted set of candidate noise models; comparing the samplednoise signal to the restricted set of candidate noise models, to yield acomparison; and selecting the matching noise model based on thecomparison.
 34. A speech recognition method, comprising: identifying abackground noise component from an input signal; comparing thebackground noise component to a plurality of previously-stored noisemodels, to yield a comparison; selecting a noise model from theplurality of previously-stored noise models based on the comparison, toyield a selected noise model; and performing speech recognition on theinput signal with reference to the selected noise model.
 35. The speechrecognition method of claim 34, further comprising: identifying asubsequent background noise component from the input signal; comparingthe subsequent background noise component to the plurality ofpreviously-stored noise models, to yield a second comparison; selectinga second noise model from the plurality of previously-stored noisemodels based on the second comparison, to yield a second selected noisemodel; and performing speech recognition on the input signal withreference to second selected noise model.
 36. The speech recognitionmethod of claim 34, further comprising: when speech recognition fails,selecting a second noise model from the plurality of previously-storednoise models based on the second comparison, to yield a second selectednoise model; and performing speech recognition on the input signal withreference to the second selected noise model.
 37. The speech recognitionmethod of claim 34, further comprising, wherein the identifying occurswhile prompting a user with an introductory message.
 38. The speechrecognition method of claim 34, wherein the comparing uses hidden Markovmodels of the plurality of previously-stored noise models and a hiddenMarkov model of the background noise component.
 39. The speechrecognition method of claim 34, further comprising, when a plurality ofnoise models from the plurality of previously-stored noise models matchthe background noise component, selecting one of the plurality ofpreviously-stored noise models as a most closely matching noise modelaccording to a selection made in a prior iteration.
 40. The speechrecognition method of claim 34, wherein the comparing and selectingcomprise: restricting a set of candidate noise models based ongeographic information associated with the background noise component,to yield a restricted set of candidate noise models; comparing thebackground noise component to the restricted set of candidate noisemodels, to yield a second comparison; and selecting the matching noisemodel based on the second comparison.
 41. The speech recognition methodof claim 34, wherein the comparing and selecting comprise: restricting aset of candidate noise models based on time of day informationassociated with the background noise component, to yield a restrictedset of candidate noise models; comparing the background noise componentto the restricted set of candidate noise models, to yield a secondcomparison; and selecting the matching noise model based on the secondcomparison.
 42. The speech recognition method of claim 34, wherein thecomparing and selection comprise: restricting a set of candidate noisemodels based on an identifier of a user at a location from which theinput signal is captured, to yield a restricted set of candidate noisemodels; comparing the background noise component to the restricted setof candidate noise models, to yield a second comparison; and selecting aclosely matching noise model based on the second comparison.